This notebook contains all code for parsing and analysis the protein clusters generated using MMSeq2.
To explore the degree of sequence divergencen and similarity in each of the clusters created by MMSeq2, all-versus-all BLASTP analysis was performed. In every instance, the script run_blastp.py from the Python package pyrewton DOI:10.5281/zenodo.3876218) was used to run the BLASTP all-versus-all analysis.
The table compiled by BLASTP contains the following columns: - qseqid - sseqid - pident - length - mismatch - gapopen - qstart - qend - sstart - send - evalue - bitscore
In order to compare each of the pair-wise-alignment we need to calculate the Blast Score Ratio (SCR) to normalise for length. This was first presented by Rasko et al., 2005.
The bitscore reported by BLAST is the sum of the qualities of the aligned symbols over the whole alignment. This is an accurate measure of the alignment strength, but long sequences tend to have higher bitscores than short sequences, even when the matches are of about the same quality. To correct for this length effect, we can calculate a normalised bitscore where:
normalised bitscore = bitscore / query length
A representative sequence from each of the 4 largest clusters compiled using MMSeq2 (with a percentage identity and coverage of cut-off of 70%) was extracted from a local CAZyme database using cazy_webscraper.
Each representative sequence was identified by using the GenBank accession assigned as the name of the cluster by MMSeq2.
Table 3.1 presents the raw output from BLASTP as well as the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.
| qseqid | sseqid | pident | cov | qlen | slen | alen | bitscore | evalue | bsr |
|---|---|---|---|---|---|---|---|---|---|
| CBK69950.1 | CBK69950.1 | 100.000 | 100 | 539 | 539 | 539 | 1121 | 0 | 2.0797774 |
| CBK69950.1 | AGE22437.1 | 49.171 | 100 | 539 | 567 | 543 | 520 | 0 | 0.9647495 |
| CBK69950.1 | CDG29680.1 | 49.631 | 99 | 539 | 533 | 542 | 520 | 0 | 0.9647495 |
| CBK69950.1 | QJR11213.1 | 44.383 | 99 | 539 | 534 | 543 | 412 | 0 | 0.7643785 |
| QJR11213.1 | QJR11213.1 | 100.000 | 100 | 534 | 534 | 534 | 1107 | 0 | 2.0730337 |
| QJR11213.1 | CDG29680.1 | 47.532 | 99 | 534 | 533 | 547 | 477 | 0 | 0.8932584 |
| QJR11213.1 | AGE22437.1 | 49.358 | 99 | 534 | 567 | 545 | 473 | 0 | 0.8857678 |
| QJR11213.1 | CBK69950.1 | 44.954 | 99 | 534 | 539 | 545 | 417 | 0 | 0.7808989 |
| CDG29680.1 | CDG29680.1 | 100.000 | 100 | 533 | 533 | 533 | 1118 | 0 | 2.0975610 |
| CDG29680.1 | AGE22437.1 | 66.417 | 99 | 533 | 567 | 533 | 753 | 0 | 1.4127580 |
| CDG29680.1 | CBK69950.1 | 49.631 | 99 | 533 | 539 | 542 | 520 | 0 | 0.9756098 |
| CDG29680.1 | QJR11213.1 | 47.091 | 99 | 533 | 534 | 550 | 471 | 0 | 0.8836773 |
| AGE22437.1 | AGE22437.1 | 100.000 | 100 | 567 | 567 | 567 | 1179 | 0 | 2.0793651 |
| AGE22437.1 | CDG29680.1 | 66.417 | 94 | 567 | 533 | 533 | 753 | 0 | 1.3280423 |
| AGE22437.1 | CBK69950.1 | 49.171 | 94 | 567 | 539 | 543 | 520 | 0 | 0.9171076 |
| AGE22437.1 | QJR11213.1 | 49.088 | 93 | 567 | 534 | 548 | 469 | 0 | 0.8271605 |
Figure 3.1 is an interactive plot presenting the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.
To view the specific BSR for each comparison, hover over the plot and a tooltip will appear and will present the GenBank accessions of the corresponding proteins as well as the specific BSR value (to 3dp).
Figure 3.1: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.
The cluster CBK69950.1 contained 33 protein sequences, QJR11213.1 28 protein sequences, CDG29680.1 17 protein sequences, and AGE22437.1 contaiend 13 protein sequences.
The BSR infer that the two smaller clusters (CDG29680.1 and AGE22437.1) could potentially be combined to create a large cluster. The proteins in this new combined clusters could then be aligned to create an multisequence alignment (MSA) of functionally relevant proteins for molecular modeling. Conversely, the BSR inferred the two larger clusters should be kept separate ( and not combined) to create a high quality MSA (i.e. minimal gaps and readily identifiable consensus sequences) of the proteins within each cluster.
To explore the sequence divergence in each of the 4 largest clusters created by MMSeq2, for each cluster the protein sequences were retrieved using cazy_webscraper and were written to a FASTA file. The protein sequences were then compared to one another in a BLASTP all-versus-all analysis.
Figure 4.1 presents the heatmap of the all-versus-all BLASTP analysis of the 13 protein sequences in the AGE22437.1 cluster. The sequence BAM46395.1 has the least sequence similarity to all other members of the cluster. Overall, the cluster has a relatively similar sequence as inferred from the BSR.
Figure 4.1: Heatmap of all-versus-all BLASTP of the AGE22437.1 cluster
Figure 4.2 presents the heatmap of the all-versus-all BLASTP analysis of the 33 protein sequences in the CBK6950.1 cluster. Except for 2 instances, every pair-wise alignment produced a BSR of greater than 1.3 with many with a BRS greater than 1.5. This infers a strong sequence similarity across all proteins in the cluster. However,
- CBL10126.1 against VEI47713.1
- CBL13352.1 against VEI47713.1
- VEI47713.1 against CBL10126.1
- VEI47713.1 against CBL13352.1
produced BRS of 0.04.
Figure 4.2: Heatmap of all-versus-all BLASTP of the CBK6950.1 cluster
Figure 4.3 presents the heatmap of the all-versus-all BLASTP analysis of the 17 protein sequences in the CDG29680.1 cluster. As with CBK6950.1, overall all proteins in the cluster have a similar protein sequence to one another as inferred from the high (greater than 1.4) BSR. However, there are a few instances of pairwise alignments with BSRs of less than 0.5.
Figure 4.3: Heatmap of all-versus-all BLASTP of the CDG29680.1 cluster
Figure 4.4 presents the heatmap of the all-versus-all BLASTP analysis of the 28 protein sequences in the QJR11213.1 cluster. The heatmap shows clusters of proteins with high sequence similarity, as inferred from the high (greater than 1.8) BSR. AK102785.1, QJR11213.1 and QJR15254.1 both relatively lower BSRs of approximately 1.4 against all other proteins in the cluster. However, a BSR of 1.4 is still a relatively high BSR, therefore, there is relatively little protein sequence variation in this cluster.
Figure 4.4: Heatmap of all-versus-all BLASTP of the QJR11213.1 cluster
Figure 4.1 inferred that the clusters AGE22437.1 and CDG29680.1 could potentially be combined. To further explore this the a BLASTP all-versus-all analysis of all proteins in both clusters was performed. The BSRs from this analysis are presented in figure @ref(fig:AGE22437CDG29680}, and the plot demonstrates that the vast majoirty of pair-wise alignements produced a BSR of greater than 1.4. This infers relatively sequence divergence between all proteins across the two clusters, and therefore, the two clusters can be combined.
Figure 5.1: Heatmap of all-versus-all BLASTP of the proteins from both AGE22437.1 and CDG29680.1 clusters
To explore the possibility of creating a MSA with a consensus across the majority of proteins when aligning proteins from all 4 clusters (AGE22437.1, CBK6950.1, CDG29680.1 and QJR11213.1), a all-versus-all BLASTP analysis of all proteins in all 4 clusters was performed. The BSR of all pairwise alignments are presented in figure ??. Clear clusters of proteins with high BSRs (greater than 1.5) were observed, and these clusters included proteins from across multiple clusters. The majority of pairwise alignments scores a BSR of 1 and greater. Consequently, it was hypothesised that all 4 clusters could be combined to generate a MSA of functionally relevant proteins for molecular modeling.
Figure 5.2: Heatmap of all-versus-all BLASTP of the proteins from both AGE22437.1, CBK6950.1, CDG29680.1 and QJR11213.1 clusters
The total number of proteins in across all 4 clusters was 91. This included 0 proteins with PDB accessions listed in UniProt. The following SQL command was used to retrive the results:
WITH Ec_Query (ec_gbk_acc) AS (
SELECT DISTINCT Genbanks.genbank_accession
FROM Genbanks
INNER JOIN Genbanks_Ecs ON Genbanks.genbank_id = Genbanks_Ecs.genbank_id
INNER JOIN Ecs ON Genbanks_Ecs.ec_id = Ecs.ec_id
WHERE Ecs.ec_number = '3.2.1.37'
)
SELECT DISTINCT Genbanks.genbank_accession, Pdbs.pdb_accession
FROM Genbanks
INNER JOIN Pdbs ON Genbanks.genbank_id = Pdbs.genbank_id
LEFT JOIN Ec_Query ON Genbanks.genbank_accession = Ec_Query.ec_gbk_acc
WHERE (Genbanks.genbank_accession IN Ec_Query)
To further expand the pool of potentially functionally relevant proteins, the proteins from the CAZy families of interest (listed below) which were not included in the clusters were BLASTP queries against the members of the 4 clusters.
Glycoside Hydrolase CAZy families of interest:
- GH1
- GH2
- GH3
- GH11
- GH26
- GH30
- GH43
- GH51
- GH52
- GH54
- GH116
- GH120
From HMMER3, hmmbuild was used to construct a pHMM model of representing the 91 protein sequences from the 4 clusters of interest. The protein sequences from the CAZy families of interest that were no in the 4 clusters of interset were queried against the pHMM model to find additional potentially functionally relevant proteins.
59 protein sequences were found and added to the protein pool of 91 protein sequences from the 4 clusters of interest, creating an expanded protein pool of 150 protein sequences. A BLASTP all-vs-all analysis was performed to explore the sequence diversity across the expanded protein sequence pool. The output of this is shown in figure @ref{fig:expandedPool}.
Figure 6.1: Heatmap of BSR of BLASTP all-vs-all pairwise alignemnts of all 150 proteins in the expanded protein pool
The majority of pairwise alignments produced a BSR of between 0.7 and 0.8. Distinct small clusters of pairwise alignments produced very high BRS of greater than 1.8. Very few pairwise alignments produced a BSR of less than 0.7. Altogether the results inferred a relatively higher level of sequence conservation across the entire expanded protein pool and a MSA with minimal gaps should be producible from the expanded protein pool.